Instructions

The following tutorial will let you reproduce the plots that we are going to create at the workshop using R.
Please read carefully and follow the steps. Wherever you see the Code icon on the right you can click on it to see the actual code used in that section.

Introduction

This tutorial will focus on analysing the updated data of the worldwide Novel Corona virus (COVID-19) pandemic.
There are several data sources available online. We will use the data collected from a range of official sources and hosted on the Our World in Data website (Mathieu et al. 2021).

Analyse Data in R

To run R and RStudio on Binder, click on this badge - Launch Rstudio Binder.

Start RStudio and create a new project named Workshop3 in a new folder (if you need a reminder ho to do it, check out Workshop1 Tutorial on BB).
Once RStudio restarts inside the project’s folder, create a new R script named Workshop3.R and 2 new folders, one named data for our input data and another named output for our plots.

Install Extra Packages

For this analysis we will again use some packages from the Tidyverse, but this time we load the specific packages (which are supposed to be pre-installed on your computers) to try and avoid having to download the entire tidyverse. In addition to the Tidyverse packages we’ve got to know in the previous workshop we will use the plotly package to create interactive plots, paletteer for custom color palettes, readxl to read MS-Excel file, scales to format large numbers, lubridate to better handle dates, glue to paste together strings, patchwork to include several plots in a single figure and a few others to assist in getting the data into shape.
To install these packages, we will introduce a package called pacman that will assist in loading the required packages and installing them if they’re not already installed. To install it we use the install.packages('pacman') command, please note that the package name need to be quoted and that we only need to perform it once, or when we want or need to update the package. Once the package was installed, we can load its functions using the library(pacman) command and then load/install all the other packages at once with p_load() function.

# install required packages - needed only once! (comment with a # after first use)
install.packages('pacman')
# load required packages
library(pacman)
p_load(tidyverse,  paletteer, glue, scales, plotly, lubridate, patchwork, visdat)

More information on installing and using R packages can be found in this tutorial.

Read Data

Now that we’ve got RStudio up and running and our packages installed and loaded, we can read data into R from our local computer or from web locations using dedicated functions specific to the file type (.csv, .txt, .xlsx, etc.).

We will use the read_csv() command/function from the readr package (part of the tidyverse) to load the data directly from a file on the Our World in Data website into a variable of type data frame (table). If we don’t want to use external packages, we can use the read.csv() function from base R, which won’t automatically parse columns containing dates and in previous versions of R (< 4.0) will slightly change the structure of the resulting data frame (all text columns will be converted into factors).
> Note that in this case, we need to specify the column types because the data contains a lot of missing values that interfere with the automatic parsing of the column types._

# read data from Our World in Data website
covid_data <- read_csv("https://covid.ourworldindata.org/data/owid-covid-data.csv", 
                       col_types = paste(c("c", "f", "c", "D", rep("d", 29), 
                                           "c", rep("d", 33)), collapse = ""))

Data Exploration

Let’s use built-in functions for a brief data exploration, such as head() to show the first 10 rows of the data and str() for the type of data in each column (see detailed information on each variable in the data repository on GitHub):

#explore the data frame
head(covid_data) # show first 10 rows of the data and typr of variables
## # A tibble: 6 x 67
##   iso_code continent location  date       total_cases new_cases new_cases_smoot~
##   <chr>    <fct>     <chr>     <date>           <dbl>     <dbl>            <dbl>
## 1 AFG      Asia      Afghanis~ 2020-02-24           5         5               NA
## 2 AFG      Asia      Afghanis~ 2020-02-25           5         0               NA
## 3 AFG      Asia      Afghanis~ 2020-02-26           5         0               NA
## 4 AFG      Asia      Afghanis~ 2020-02-27           5         0               NA
## 5 AFG      Asia      Afghanis~ 2020-02-28           5         0               NA
## 6 AFG      Asia      Afghanis~ 2020-02-29           5         0               NA
## # ... with 60 more variables: total_deaths <dbl>, new_deaths <dbl>,
## #   new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## #   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## #   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## #   new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## #   icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## #   hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, ...
str(covid_data) # show data structure
## spec_tbl_df [175,986 x 67] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ iso_code                                  : chr [1:175986] "AFG" "AFG" "AFG" "AFG" ...
##  $ continent                                 : Factor w/ 6 levels "Asia","Europe",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ location                                  : chr [1:175986] "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ date                                      : Date[1:175986], format: "2020-02-24" "2020-02-25" ...
##  $ total_cases                               : num [1:175986] 5 5 5 5 5 5 5 5 5 5 ...
##  $ new_cases                                 : num [1:175986] 5 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed                        : num [1:175986] NA NA NA NA NA NA 0.714 0 0 0 ...
##  $ total_deaths                              : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths                                : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths_smoothed                       : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ total_cases_per_million                   : num [1:175986] 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 0.126 ...
##  $ new_cases_per_million                     : num [1:175986] 0.126 0 0 0 0 0 0 0 0 0 ...
##  $ new_cases_smoothed_per_million            : num [1:175986] NA NA NA NA NA NA 0.018 0 0 0 ...
##  $ total_deaths_per_million                  : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths_per_million                    : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_deaths_smoothed_per_million           : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ reproduction_rate                         : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients                              : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ icu_patients_per_million                  : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients                             : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ hosp_patients_per_million                 : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions                     : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_icu_admissions_per_million         : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions                    : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ weekly_hosp_admissions_per_million        : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests                               : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests                                 : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ total_tests_per_thousand                  : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_per_thousand                    : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed                        : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_tests_smoothed_per_thousand           : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ positive_rate                             : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_per_case                            : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ tests_units                               : chr [1:175986] NA NA NA NA ...
##  $ total_vaccinations                        : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated                         : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated                   : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters                            : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations                          : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed                 : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ total_vaccinations_per_hundred            : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ people_vaccinated_per_hundred             : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ people_fully_vaccinated_per_hundred       : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ total_boosters_per_hundred                : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_vaccinations_smoothed_per_million     : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed            : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ new_people_vaccinated_smoothed_per_hundred: num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ stringency_index                          : num [1:175986] 8.33 8.33 8.33 8.33 8.33 ...
##  $ population                                : num [1:175986] 39835428 39835428 39835428 39835428 39835428 ...
##  $ population_density                        : num [1:175986] 54.4 54.4 54.4 54.4 54.4 ...
##  $ median_age                                : num [1:175986] 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 18.6 ...
##  $ aged_65_older                             : num [1:175986] 2.58 2.58 2.58 2.58 2.58 ...
##  $ aged_70_older                             : num [1:175986] 1.34 1.34 1.34 1.34 1.34 ...
##  $ gdp_per_capita                            : num [1:175986] 1804 1804 1804 1804 1804 ...
##  $ extreme_poverty                           : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ cardiovasc_death_rate                     : num [1:175986] 597 597 597 597 597 ...
##  $ diabetes_prevalence                       : num [1:175986] 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 9.59 ...
##  $ female_smokers                            : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ male_smokers                              : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ handwashing_facilities                    : num [1:175986] 37.7 37.7 37.7 37.7 37.7 ...
##  $ hospital_beds_per_thousand                : num [1:175986] 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ life_expectancy                           : num [1:175986] 64.8 64.8 64.8 64.8 64.8 ...
##  $ human_development_index                   : num [1:175986] 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 0.511 ...
##  $ excess_mortality_cumulative_absolute      : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative               : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality                          : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  $ excess_mortality_cumulative_per_million   : num [1:175986] NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   iso_code = col_character(),
##   ..   continent = col_factor(levels = NULL, ordered = FALSE, include_na = FALSE),
##   ..   location = col_character(),
##   ..   date = col_date(format = ""),
##   ..   total_cases = col_double(),
##   ..   new_cases = col_double(),
##   ..   new_cases_smoothed = col_double(),
##   ..   total_deaths = col_double(),
##   ..   new_deaths = col_double(),
##   ..   new_deaths_smoothed = col_double(),
##   ..   total_cases_per_million = col_double(),
##   ..   new_cases_per_million = col_double(),
##   ..   new_cases_smoothed_per_million = col_double(),
##   ..   total_deaths_per_million = col_double(),
##   ..   new_deaths_per_million = col_double(),
##   ..   new_deaths_smoothed_per_million = col_double(),
##   ..   reproduction_rate = col_double(),
##   ..   icu_patients = col_double(),
##   ..   icu_patients_per_million = col_double(),
##   ..   hosp_patients = col_double(),
##   ..   hosp_patients_per_million = col_double(),
##   ..   weekly_icu_admissions = col_double(),
##   ..   weekly_icu_admissions_per_million = col_double(),
##   ..   weekly_hosp_admissions = col_double(),
##   ..   weekly_hosp_admissions_per_million = col_double(),
##   ..   total_tests = col_double(),
##   ..   new_tests = col_double(),
##   ..   total_tests_per_thousand = col_double(),
##   ..   new_tests_per_thousand = col_double(),
##   ..   new_tests_smoothed = col_double(),
##   ..   new_tests_smoothed_per_thousand = col_double(),
##   ..   positive_rate = col_double(),
##   ..   tests_per_case = col_double(),
##   ..   tests_units = col_character(),
##   ..   total_vaccinations = col_double(),
##   ..   people_vaccinated = col_double(),
##   ..   people_fully_vaccinated = col_double(),
##   ..   total_boosters = col_double(),
##   ..   new_vaccinations = col_double(),
##   ..   new_vaccinations_smoothed = col_double(),
##   ..   total_vaccinations_per_hundred = col_double(),
##   ..   people_vaccinated_per_hundred = col_double(),
##   ..   people_fully_vaccinated_per_hundred = col_double(),
##   ..   total_boosters_per_hundred = col_double(),
##   ..   new_vaccinations_smoothed_per_million = col_double(),
##   ..   new_people_vaccinated_smoothed = col_double(),
##   ..   new_people_vaccinated_smoothed_per_hundred = col_double(),
##   ..   stringency_index = col_double(),
##   ..   population = col_double(),
##   ..   population_density = col_double(),
##   ..   median_age = col_double(),
##   ..   aged_65_older = col_double(),
##   ..   aged_70_older = col_double(),
##   ..   gdp_per_capita = col_double(),
##   ..   extreme_poverty = col_double(),
##   ..   cardiovasc_death_rate = col_double(),
##   ..   diabetes_prevalence = col_double(),
##   ..   female_smokers = col_double(),
##   ..   male_smokers = col_double(),
##   ..   handwashing_facilities = col_double(),
##   ..   hospital_beds_per_thousand = col_double(),
##   ..   life_expectancy = col_double(),
##   ..   human_development_index = col_double(),
##   ..   excess_mortality_cumulative_absolute = col_double(),
##   ..   excess_mortality_cumulative = col_double(),
##   ..   excess_mortality = col_double(),
##   ..   excess_mortality_cumulative_per_million = col_double()
##   .. )

Descriptive Statistics

We can also produce some descriptive statistics to better understand the data and the nature of each variable. The summary() function (as can be guessed by its name) provides a quick summary of basic descriptive statistics, such as the mean, min, max and quantiles for continuous numerical values.

# summary of variables in my data
summary(covid_data)
##    iso_code                 continent       location        
##  Length:175986      Asia         :37546   Length:175986     
##  Class :character   Europe       :38504   Class :character  
##  Mode  :character   Africa       :40957   Mode  :character  
##                     North America:27408                     
##                     South America: 9881                     
##                     Oceania      :11370                     
##                     NA's         :10320                     
##       date             total_cases          new_cases       new_cases_smoothed
##  Min.   :2020-01-01   Min.   :        1   Min.   :      0   Min.   :      0   
##  1st Qu.:2020-09-15   1st Qu.:     2215   1st Qu.:      1   1st Qu.:      7   
##  Median :2021-03-28   Median :    29306   Median :     79   Median :    108   
##  Mean   :2021-03-22   Mean   :  2769817   Mean   :  12308   Mean   :  12325   
##  3rd Qu.:2021-09-28   3rd Qu.:   323160   3rd Qu.:   1077   3rd Qu.:   1177   
##  Max.   :2022-04-02   Max.   :490664062   Max.   :4089078   Max.   :3436882   
##                       NA's   :6309        NA's   :6508      NA's   :8502      
##   total_deaths       new_deaths      new_deaths_smoothed
##  Min.   :      1   Min.   :    0.0   Min.   :    0.000  
##  1st Qu.:     84   1st Qu.:    0.0   1st Qu.:    0.143  
##  Median :    822   Median :    2.0   Median :    2.429  
##  Mean   :  60003   Mean   :  168.3   Mean   :  170.048  
##  3rd Qu.:   7670   3rd Qu.:   19.0   3rd Qu.:   21.143  
##  Max.   :6151255   Max.   :18156.0   Max.   :14795.857  
##  NA's   :24379     NA's   :24367     NA's   :26527      
##  total_cases_per_million new_cases_per_million new_cases_smoothed_per_million
##  Min.   :     0.0        Min.   :    0.00      Min.   :    0.000             
##  1st Qu.:   659.3        1st Qu.:    0.03      1st Qu.:    1.628             
##  Median :  5152.5        Median :   11.37      Median :   19.317             
##  Mean   : 33126.1        Mean   :  179.55      Mean   :  178.980             
##  3rd Qu.: 41583.2        3rd Qu.:  103.60      3rd Qu.:  125.521             
##  Max.   :706541.9        Max.   :51427.49      Max.   :16052.608             
##  NA's   :7095            NA's   :7294          NA's   :9282                  
##  total_deaths_per_million new_deaths_per_million
##  Min.   :   0.00          Min.   :  0.000       
##  1st Qu.:  19.99          1st Qu.:  0.000       
##  Median : 141.91          Median :  0.118       
##  Mean   : 534.91          Mean   :  1.669       
##  3rd Qu.: 764.43          3rd Qu.:  1.358       
##  Max.   :6363.99          Max.   :453.772       
##  NA's   :25152            NA's   :25140         
##  new_deaths_smoothed_per_million reproduction_rate  icu_patients    
##  Min.   :  0.000                 Min.   :-0.04     Min.   :    0.0  
##  1st Qu.:  0.016                 1st Qu.: 0.79     1st Qu.:   31.0  
##  Median :  0.290                 Median : 0.99     Median :  155.0  
##  Mean   :  1.670                 Mean   : 0.99     Mean   :  910.5  
##  3rd Qu.:  1.763                 3rd Qu.: 1.17     3rd Qu.:  600.0  
##  Max.   :144.167                 Max.   : 6.13     Max.   :28891.0  
##  NA's   :27294                   NA's   :44834     NA's   :152433   
##  icu_patients_per_million hosp_patients    hosp_patients_per_million
##  Min.   :  0.00           Min.   :     0   Min.   :   0.00          
##  1st Qu.:  4.50           1st Qu.:   140   1st Qu.:  28.81          
##  Median : 13.71           Median :   774   Median :  91.60          
##  Mean   : 24.05           Mean   :  4288   Mean   : 170.74          
##  3rd Qu.: 34.66           3rd Qu.:  2969   3rd Qu.: 232.24          
##  Max.   :177.28           Max.   :154540   Max.   :1544.08          
##  NA's   :152433           NA's   :151021   NA's   :151021           
##  weekly_icu_admissions weekly_icu_admissions_per_million weekly_hosp_admissions
##  Min.   :   0.00       Min.   :  0.00                    Min.   :     0        
##  1st Qu.:  50.25       1st Qu.:  4.09                    1st Qu.:   345        
##  Median : 218.00       Median : 10.95                    Median :  1393        
##  Mean   : 466.72       Mean   : 15.26                    Mean   :  5911        
##  3rd Qu.: 659.00       3rd Qu.: 20.12                    3rd Qu.:  5345        
##  Max.   :4838.00       Max.   :221.21                    Max.   :154002        
##  NA's   :170268        NA's   :170268                    NA's   :164589        
##  weekly_hosp_admissions_per_million  total_tests          new_tests      
##  Min.   :  0.00                     Min.   :        0   Min.   :      1  
##  1st Qu.: 24.92                     1st Qu.:   323269   1st Qu.:   2150  
##  Median : 75.36                     Median :  1821136   Median :   8673  
##  Mean   :105.08                     Mean   : 17905115   Mean   :  66788  
##  3rd Qu.:144.12                     3rd Qu.:  8888466   3rd Qu.:  36513  
##  Max.   :645.81                     Max.   :847109806   Max.   :3740296  
##  NA's   :164589                     NA's   :102386      NA's   :104844   
##  total_tests_per_thousand new_tests_per_thousand new_tests_smoothed
##  Min.   :    0.00         Min.   :  0.00         Min.   :      0   
##  1st Qu.:   38.59         1st Qu.:  0.27         1st Qu.:   1786   
##  Median :  202.58         Median :  0.94         Median :   7621   
##  Mean   :  793.00         Mean   :  3.25         Mean   :  58840   
##  3rd Qu.:  768.13         3rd Qu.:  2.91         3rd Qu.:  34008   
##  Max.   :31688.97         Max.   :534.01         Max.   :3080396   
##  NA's   :102386           NA's   :104844         NA's   :84662     
##  new_tests_smoothed_per_thousand positive_rate   tests_per_case    
##  Min.   :  0.00                  Min.   :0.00    Min.   :     1.0  
##  1st Qu.:  0.23                  1st Qu.:0.02    1st Qu.:     7.1  
##  Median :  0.90                  Median :0.06    Median :    17.0  
##  Mean   :  2.89                  Mean   :0.10    Mean   :   147.6  
##  3rd Qu.:  2.65                  3rd Qu.:0.14    3rd Qu.:    51.0  
##  Max.   :147.60                  Max.   :0.99    Max.   :189932.0  
##  NA's   :84662                   NA's   :91217   NA's   :92085     
##  tests_units        total_vaccinations  people_vaccinated  
##  Length:175986      Min.   :0.000e+00   Min.   :0.000e+00  
##  Class :character   1st Qu.:6.521e+05   1st Qu.:4.046e+05  
##  Mode  :character   Median :5.194e+06   Median :3.195e+06  
##                     Mean   :1.875e+08   Mean   :9.453e+07  
##                     3rd Qu.:3.308e+07   3rd Qu.:1.890e+07  
##                     Max.   :1.129e+10   Max.   :5.079e+09  
##                     NA's   :128425      NA's   :130772     
##  people_fully_vaccinated total_boosters      new_vaccinations  
##  Min.   :1.000e+00       Min.   :1.000e+00   Min.   :       0  
##  1st Qu.:2.942e+05       1st Qu.:3.269e+03   1st Qu.:    6005  
##  Median :2.553e+06       Median :5.558e+05   Median :   40663  
##  Mean   :7.584e+07       Mean   :2.408e+07   Mean   : 1158698  
##  3rd Qu.:1.495e+07       3rd Qu.:5.046e+06   3rd Qu.:  276462  
##  Max.   :4.565e+09       Max.   :1.652e+09   Max.   :54498879  
##  NA's   :133413          NA's   :155893      NA's   :136672    
##  new_vaccinations_smoothed total_vaccinations_per_hundred
##  Min.   :       0          Min.   :  0.00                
##  1st Qu.:     992          1st Qu.: 13.54                
##  Median :    8918          Median : 63.00                
##  Mean   :  508364          Mean   : 77.09                
##  3rd Qu.:   64254          3rd Qu.:129.33                
##  Max.   :43567880          Max.   :348.27                
##  NA's   :85734             NA's   :128425                
##  people_vaccinated_per_hundred people_fully_vaccinated_per_hundred
##  Min.   :  0.00                Min.   :  0.00                     
##  1st Qu.:  9.50                1st Qu.:  5.73                     
##  Median : 38.82                Median : 29.47                     
##  Mean   : 39.32                Mean   : 34.07                     
##  3rd Qu.: 66.15                3rd Qu.: 60.04                     
##  Max.   :124.77                Max.   :122.54                     
##  NA's   :130772                NA's   :133413                     
##  total_boosters_per_hundred new_vaccinations_smoothed_per_million
##  Min.   :  0.00             Min.   :     0                       
##  1st Qu.:  0.02             1st Qu.:   636                       
##  Median :  4.24             Median :  2071                       
##  Mean   : 14.24             Mean   :  3217                       
##  3rd Qu.: 25.25             3rd Qu.:  4598                       
##  Max.   :100.96             Max.   :117497                       
##  NA's   :155893             NA's   :85734                        
##  new_people_vaccinated_smoothed new_people_vaccinated_smoothed_per_hundred
##  Min.   :       0               Min.   : 0.00                             
##  1st Qu.:     386               1st Qu.: 0.02                             
##  Median :    3663               Median : 0.07                             
##  Mean   :  202883               Mean   : 0.14                             
##  3rd Qu.:   25216               3rd Qu.: 0.18                             
##  Max.   :21377378               Max.   :11.75                             
##  NA's   :86606                  NA's   :86606                             
##  stringency_index   population        population_density    median_age   
##  Min.   :  0.00   Min.   :4.700e+01   Min.   :    0.137   Min.   :15.10  
##  1st Qu.: 39.81   1st Qu.:9.029e+05   1st Qu.:   37.312   1st Qu.:22.30  
##  Median : 54.17   Median :7.553e+06   Median :   87.324   Median :30.60  
##  Mean   : 54.06   Mean   :1.444e+08   Mean   :  460.409   Mean   :30.66  
##  3rd Qu.: 70.37   3rd Qu.:3.278e+07   3rd Qu.:  214.243   3rd Qu.:39.10  
##  Max.   :100.00   Max.   :7.875e+09   Max.   :20546.766   Max.   :48.20  
##  NA's   :38448    NA's   :1103        NA's   :19223       NA's   :30586  
##  aged_65_older   aged_70_older    gdp_per_capita     extreme_poverty
##  Min.   : 1.14   Min.   : 0.526   Min.   :   661.2   Min.   : 0.10  
##  1st Qu.: 3.53   1st Qu.: 2.063   1st Qu.:  4449.9   1st Qu.: 0.60  
##  Median : 6.70   Median : 4.209   Median : 13111.2   Median : 2.20  
##  Mean   : 8.85   Mean   : 5.580   Mean   : 19687.7   Mean   :13.59  
##  3rd Qu.:14.31   3rd Qu.: 9.167   3rd Qu.: 27936.9   3rd Qu.:21.20  
##  Max.   :27.05   Max.   :18.493   Max.   :116935.6   Max.   :77.60  
##  NA's   :32136   NA's   :31353    NA's   :31109      NA's   :81276  
##  cardiovasc_death_rate diabetes_prevalence female_smokers   male_smokers  
##  Min.   : 79.37        Min.   : 0.990      Min.   : 0.10   Min.   : 7.70  
##  1st Qu.:168.71        1st Qu.: 5.350      1st Qu.: 1.90   1st Qu.:21.60  
##  Median :243.81        Median : 7.200      Median : 6.30   Median :31.40  
##  Mean   :259.95        Mean   : 8.355      Mean   :10.64   Mean   :32.78  
##  3rd Qu.:329.94        3rd Qu.:10.590      3rd Qu.:19.30   3rd Qu.:41.30  
##  Max.   :724.42        Max.   :30.530      Max.   :44.00   Max.   :78.10  
##  NA's   :30883         NA's   :23980       NA's   :65791   NA's   :67302  
##  handwashing_facilities hospital_beds_per_thousand life_expectancy
##  Min.   :  1.19         Min.   : 0.10              Min.   :53.28  
##  1st Qu.: 20.86         1st Qu.: 1.30              1st Qu.:69.59  
##  Median : 49.84         Median : 2.40              Median :75.09  
##  Mean   : 50.85         Mean   : 3.03              Mean   :73.67  
##  3rd Qu.: 82.50         3rd Qu.: 4.00              3rd Qu.:79.19  
##  Max.   :100.00         Max.   :13.80              Max.   :86.75  
##  NA's   :104581         NA's   :47324              NA's   :11466  
##  human_development_index excess_mortality_cumulative_absolute
##  Min.   :0.39            Min.   : -37726                     
##  1st Qu.:0.60            1st Qu.:    -45                     
##  Median :0.74            Median :   3526                     
##  Mean   :0.73            Mean   :  38593                     
##  3rd Qu.:0.84            3rd Qu.:  26111                     
##  Max.   :0.96            Max.   :1111865                     
##  NA's   :34265           NA's   :169963                      
##  excess_mortality_cumulative excess_mortality
##  Min.   :-28.45              Min.   :-95.92  
##  1st Qu.: -0.46              1st Qu.: -0.58  
##  Median :  6.29              Median :  7.42  
##  Mean   :  9.56              Mean   : 15.84  
##  3rd Qu.: 14.61              3rd Qu.: 22.48  
##  Max.   :111.01              Max.   :375.00  
##  NA's   :169963              NA's   :169963  
##  excess_mortality_cumulative_per_million
##  Min.   :-1826.60                       
##  1st Qu.:  -19.13                       
##  Median :  508.82                       
##  Mean   : 1023.02                       
##  3rd Qu.: 1690.75                       
##  Max.   : 9339.47                       
##  NA's   :169963
# find extreme rows
covid_data %>% arrange(desc(total_cases))
## # A tibble: 175,986 x 67
##    iso_code continent location date       total_cases new_cases new_cases_smoot~
##    <chr>    <fct>     <chr>    <date>           <dbl>     <dbl>            <dbl>
##  1 OWID_WRL <NA>      World    2022-04-02   490664062   1055332         1406584.
##  2 OWID_WRL <NA>      World    2022-04-01   489608730   1170974         1449590.
##  3 OWID_WRL <NA>      World    2022-03-31   488437812   1874508         1545464.
##  4 OWID_WRL <NA>      World    2022-03-30   486563304   1589756         1527973.
##  5 OWID_WRL <NA>      World    2022-03-29   484973548   1750295         1554157.
##  6 OWID_WRL <NA>      World    2022-03-28   483223253   1493115         1586314.
##  7 OWID_WRL <NA>      World    2022-03-27   481730138    912111         1583087.
##  8 OWID_WRL <NA>      World    2022-03-26   480818027   1356374         1605688.
##  9 OWID_WRL <NA>      World    2022-03-25   479461653   1842090         1656139.
## 10 OWID_WRL <NA>      World    2022-03-24   477619563   1752069         1666355.
## # ... with 175,976 more rows, and 60 more variables: total_deaths <dbl>,
## #   new_deaths <dbl>, new_deaths_smoothed <dbl>, total_cases_per_million <dbl>,
## #   new_cases_per_million <dbl>, new_cases_smoothed_per_million <dbl>,
## #   total_deaths_per_million <dbl>, new_deaths_per_million <dbl>,
## #   new_deaths_smoothed_per_million <dbl>, reproduction_rate <dbl>,
## #   icu_patients <dbl>, icu_patients_per_million <dbl>, hosp_patients <dbl>,
## #   hosp_patients_per_million <dbl>, weekly_icu_admissions <dbl>, ...

What are the metadata columns that describe our observations?

continent 
location  
date

Why do we have observations with the continent as NA?

# check which location have continent as NA
covid_data %>% filter(is.na(continent)) %>% count(location)
## # A tibble: 13 x 2
##    location                n
##    <chr>               <int>
##  1 Africa                780
##  2 Asia                  802
##  3 Europe                801
##  4 European Union        801
##  5 High income           802
##  6 International         786
##  7 Low income            770
##  8 Lower middle income   802
##  9 North America         802
## 10 Oceania               799
## 11 South America         771
## 12 Upper middle income   802
## 13 World                 802
Some rows contain summarised data of entire continents/World, we'll need to remove those

We can see that most of our data contains ‘0’ (check the difference between the median and the mean in total_cases and total_deaths columns). Just to confirm that, let’s plot a histogram of all the confirmed cases

ggplot(covid_data, aes(x=total_cases)) +
  geom_histogram(fill="lightskyblue") +
  theme_bw(def_text_size)

The data is evolving over days (a time-series), to there’s no point treating it as a random population sample.

Time-series plot

Let’s look at confirmed cases and total deaths data for the 10 most affected countries (to date). To find out these countries so we need to wrangle our data a little bit using the following steps:

  1. First we remove all observations for combined continents with filter(!is.na(continent)
  2. Then we group it by location with group_by()
  3. Then we sort it within each location by date (from latest to earliest) with arrange(desc(date))
  4. We select just the most recent data point from each location with slice(1) and remove grouping with ungroup()
  5. Next we arrange it by descending order of total cases and select the top 10 observations (one for each location)
  6. Finally, we subset our original data to contain just the countries from our vector with inner_join()

Optional step:

  1. We can recode the location variable as a factor and order it so the countries will be ordered in the legend by the number of cases with

Then we can look at the data as a table and make a plot with the number of cases in the y-axis and date in the x-axis.

# find the 10 most affected countries (to date)
latest_data <- covid_data %>% filter(!is.na(continent)) %>% 
  group_by(location) %>% arrange(desc(date)) %>% slice(1) %>% ungroup() 
most_affected_countries <- latest_data  %>%  
  arrange(desc(total_cases)) %>% slice(1:10) %>% 
  select(location)

# subset just the data from the 10 most affected countries and order them from the most affected to the least one
most_affected_data <- covid_data %>% 
  inner_join(most_affected_countries) %>% 
  mutate(Country=factor(location, levels = most_affected_countries$location))

# create a line plot the data of total deaths
ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  labs(color="Country", y = "Total COVID-19 cases") +
  theme_bw(def_text_size)

It’s a bit hard to figure out how the pandemic evolved because the numbers in US, Brazil and India are an order of magnitude larger than the rest (which are very close to each other). How can we make it more visible (and also improve how of the dates appear in the x-axis)?

# better formatting of date axis, log scale 
plot <- ggplot(most_affected_data, aes(x=date, y=total_cases, colour=Country)) +
  geom_line(size=0.75) + scale_y_log10(labels=comma) + 
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  labs(color="Country", y = "Total COVID-19 cases") +
  theme_bw(def_text_size)
# show an interactive plot
ggplotly(plot)

Why did we get a warning message and why the graphs don’t start at the bottom of the x-axis? How can we solve it? What can we infer from the graph (exponential increase)?

What happens when we take the log of 0?? Can we remove those 0s with the `filter()` function (or add a very small number to them)?
We can see a very similar trend for most countries and while the curve has flattened substantially in April last year, the numbers are still rising. It is also evident that Europe got hit by a second wave arount October last year and India in April this year.

Total deaths

Let’s have a look at the total deaths in these countries (and get rid of the minor grid lines to make Frank happy)

# create a line plot the data of total deaths
ggplot(most_affected_data, aes(x=date, y=total_deaths, colour=Country)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  labs(color="Country", y = "Total deaths") +
  theme_bw(def_text_size) +  
  theme(panel.grid.minor = element_blank()) # remove minor grid lines

Vaccination rates

Let’s have a look at the number of vaccinated people.

# vaccination rates
ggplot(most_affected_data, aes(x=date, y=people_vaccinated, colour=Country)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d("rcartocolor::Bold") +
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y="People Vaccinated") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())

The graphs are “broken”, meaning that it is not continuous and we have some missing data.
Let’s visualise some of the variables in our data and assess “missingness”.

# visualise missingness
vis_dat(covid_data %>% filter(date>dmy("01-01-2021")) %>% 
          select(continent, location, total_cases, total_deaths, 
                 hosp_patients, people_vaccinated, people_fully_vaccinated))

# find which countries has the most number of observations (least missing data)
covid_data %>% filter(!is.na(continent), !is.na(people_vaccinated)) %>% # group_by(location) %>% 
  count(location) %>% arrange(desc(n)) %>% print(n=30)
## # A tibble: 219 x 2
##    location           n
##    <chr>          <int>
##  1 Norway           484
##  2 United States    473
##  3 Latvia           472
##  4 Denmark          471
##  5 Israel           468
##  6 Liechtenstein    465
##  7 Switzerland      465
##  8 Canada           462
##  9 Chile            461
## 10 Czechia          460
## 11 Estonia          460
## 12 Germany          460
## 13 Italy            460
## 14 Lithuania        460
## 15 Slovenia         460
## 16 France           459
## 17 Argentina        458
## 18 Belgium          457
## 19 Singapore        454
## 20 Ireland          452
## 21 United Kingdom   444
## 22 Greece           440
## 23 India            430
## 24 Brazil           428
## 25 Malta            427
## 26 Indonesia        425
## 27 Ecuador          423
## 28 Bolivia          419
## 29 Peru             415
## 30 Bahrain          412
## # ... with 189 more rows
covid_data %>% filter(!is.na(continent), !is.na(hosp_patients)) %>% # group_by(location) %>% 
  count(location) %>% arrange(desc(n)) %>% print(n=30)
## # A tibble: 38 x 2
##    location           n
##    <chr>          <int>
##  1 Italy            769
##  2 Estonia          765
##  3 Netherlands      764
##  4 Israel           762
##  5 Canada           753
##  6 Czechia          753
##  7 Hungary          753
##  8 Slovakia         752
##  9 Cyprus           749
## 10 Ireland          748
## 11 Slovenia         748
## 12 Belgium          747
## 13 France           746
## 14 Sweden           744
## 15 Luxembourg       742
## 16 Portugal         741
## 17 Malaysia         740
## 18 United Kingdom   735
## 19 Australia        733
## 20 Austria          732
## 21 Serbia           732
## 22 Switzerland      732
## 23 Denmark          730
## 24 Latvia           727
## 25 Bulgaria         721
## 26 Croatia          712
## 27 Poland           702
## 28 Malta            690
## 29 Norway           642
## 30 United States    626
## # ... with 8 more rows

Hospitalisation and Vaccination rates

Now we can focus on a subset of countries that have more complete vaccination and hospitalisation rates data, so we could compare them.

countries_subset <- c("Italy", "United States", "Israel", "United Kingdom",  "France", "Czechia")
# subset our original data to these countries
hosp_data <- covid_data %>% filter(location %in% countries_subset)
# define a new colour palette for these countries
col_pal <- "ggsci::category10_d3"

Let’s look at hospitalisation rates first.

ggplot(hosp_data, aes(x=date, y = hosp_patients,colour=location)) +
  geom_line(size=0.75) + scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Hospitalised patients") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())

Can you identify the “waves” in each country?
It’s hard to see the details in the countries with lower number of hospitalised patients, how can we improve the visualisation?

Look at hospitalision rates proportional to the population size!
# hosp per population size
p1 <- ggplot(hosp_data, 
       aes(x=date, y = hosp_patients_per_million,colour=location)) +
  geom_line(size=0.75) + 
  scale_y_continuous(labels=comma) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Hospitalised patients (per million)") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())
p1

Now let’s try to compare this to vaccination rates

# total vaccination per population
p2 <- ggplot(hosp_data, 
             aes(x=date, y = people_fully_vaccinated_per_hundred/100 ,colour=location)) +
  geom_line(size=0.75, linetype="dashed") + 
  guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
  scale_y_continuous(labels=percent) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", y = "Fully vaccinated (percent)") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())
p2

What will be the best way to compare these values? Maybe like this:

(p1 + guides(color=FALSE)) / (p2 + theme(legend.position = "bottom")) #+ plot_layout(guides = 'collect')# show graphs on top of each other

Any other suggestions? Let’s try to present them on the same graph (note the trick with the secondary y-axis).

# show on the same graph
p4 <- ggplot(hosp_data, 
       aes(x=date, colour=location)) +
  geom_line(aes(y = hosp_patients_per_million), size=0.75) + 
  geom_line(aes(y = people_fully_vaccinated_per_hundred*10), size=0.75, linetype="dashed") + 
  scale_y_continuous(labels=comma, name = "Hospitalised patients per million (solid)",
                     # Add a second axis and specify its features
                     sec.axis = sec_axis(trans=~./10,  name="Fully vaccinated per hundred (dashed)")) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank()) 
p4

Probably best to present them on the same graph, but for each country separately (done with the facet_wrap() function).

# show on the same graph, but separate each country
p4 + 
  scale_x_date(NULL,
               breaks = breaks_width("4 months"), 
               labels = label_date_short()) +
  facet_wrap(~location)

Save the plot to a file.

# create output folder
dir.create("./output", showWarnings = FALSE)
# save the plot to pdf file
ggsave("output/hospit_vacc_rates_facet_country.pdf", width=14, height = 8)
# subset our original data to our countries + Europe (European average)
mort_data <- covid_data %>% 
  filter(location %in% c(countries_subset, "Europe")) %>% 
  select(date, location, new_deaths_smoothed_per_million)
# Get the EU new death rates
EU_mortality_data <- covid_data %>% filter(location=="Europe") %>% 
  select(date, EU_dpm=new_deaths_smoothed_per_million)

# create a new calculated column with the difference from the EU average
comp_mort_data <- mort_data %>% left_join(EU_mortality_data) %>% 
  mutate(new_mort_change=new_deaths_smoothed_per_million-EU_dpm,
         location=fct_relevel(factor(location), "Europe", after = Inf)) 

mort_plot <- ggplot(comp_mort_data, 
             aes(x=date, y = new_mort_change ,colour=location)) +
  geom_line(size=0.75) + 
  guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
  # scale_y_continuous(labels=percent) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("2 months"), 
               labels = label_date_short()) + 
  labs(color="Country", 
       y = "New COVID-related deaths (in comparison to the EU average)") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank())

ggplotly(mort_plot)

Save the plot to a file.

# save the plot to pdf file
ggsave("output/mortality_rates_vs_EU_average.pdf", width=12, height = 8)

Plot for each country separately:

ggplot(comp_mort_data %>% filter(location!="Europe"), 
             aes(x=date, y = new_mort_change ,colour=location)) +
  geom_line(size=0.75) + 
  guides(color = guide_legend(override.aes = list(linetype="solid") ) ) +
  # scale_y_continuous(labels=percent) + 
  scale_color_paletteer_d(col_pal) +
  scale_x_date(name = NULL,
               breaks = breaks_width("4 months"), 
               labels = label_date_short()) + 
  labs(color="Country", 
       y = "New COVID-related deaths (in comparison to the EU average)") +
  theme_bw(def_text_size) + 
  theme(panel.grid.minor = element_blank()) + 
  facet_wrap(~location, ncol = 2) +
  geom_hline(yintercept = 0)

# save the plot to pdf file
ggsave("output/mortality_rates_vs_EU_average_facets.pdf", width=11, height = 8)

Questions

  1. What other variables we could analyse?
  2. Any correlated variables?
  3. What we should take into account that might bias the results or the true status of the pandemic?
1. Mortalities (Case Fatality Rate)?  
2. Suggestions? (cases per population density, vaccination rates by country income, deaths by number of beds per capita, etc.
3. Level of reporting in each country...  

Additional Resources

Now take the plunge and start practicing with your own data!!

tweet_embed("https://twitter.com/YannisMarkonis/status/1276211134186602499")

Contact

Please contact me at i.bar@griffith.edu.au for any questions or comments.

References

Mathieu E, Ritchie H, Ortiz-Ospina E, et al. (2021) A global database of COVID-19 vaccinations. Nat Hum Behav 5:947–953. doi: 10.1038/s41562-021-01122-8